Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation

نویسندگان

چکیده

In the present study, we propose novel sequence-to-sequence pre-training objectives for low-resource machine translation (NMT): Japanese-specific sequence to (JASS) language pairs involving Japanese as source or target language, and English-specific (ENSS) English. JASS focuses on masking reordering linguistic units known bunsetsu, whereas ENSS is proposed based phrase structure tasks. Experiments ASPEC Japanese--English & Japanese--Chinese, Wikipedia News English--Korean corpora demonstrate that outperform MASS other existing language-agnostic methods by up +2.9 BLEU points tasks, +7.0 Japanese--Chinese tasks +1.3 Empirical analysis, which relationship between individual parts in ENSS, reveals complementary nature of subtasks ENSS. Adequacy evaluation using LASER, human evaluation, case studies our significantly without injected knowledge they have a larger positive impact adequacy compared fluency. We release codes here: https://github.com/Mao-KU/JASS/tree/master/linguistically-driven-pretraining.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Neural machine translation for low-resource languages

Neural machine translation (NMT) approaches have improved the state of the art in many machine translation settings over the last couple of years, but they require large amounts of training data to produce sensible output. We demonstrate that NMT can be used for low-resource languages as well, by introducing more local dependencies and using word alignments to learn sentence reordering during t...

متن کامل

Pre-Translation for Neural Machine Translation

Recently, the development of neural machine translation (NMT) has significantly improved the translation quality of automatic machine translation. While most sentences are more accurate and fluent than translations by statistical machine translation (SMT)-based systems, in some cases, the NMT system produces translations that have a completely different meaning. This is especially the case when...

متن کامل

Zero-Resource Translation with Multi-Lingual Neural Machine Translation

In this paper, we propose a novel finetuning algorithm for the recently introduced multiway, multilingual neural machine translate that enables zero-resource machine translation. When used together with novel manyto-one translation strategies, we empirically show that this finetuning algorithm allows the multi-way, multilingual model to translate a zero-resource language pair (1) as well as a s...

متن کامل

Multilingual Neural Machine Translation for Low Resource Languages

Neural Machine Translation (NMT) has been shown to be more effective in translation tasks compared to the Phrase-Based Statistical Machine Translation (PBMT). However, NMT systems are limited in translating low-resource languages (LRL), due to the fact that neural methods require a large amount of parallel data to learn effective mappings between languages. In this work we show how so-called mu...

متن کامل

Data Augmentation for Low-Resource Neural Machine Translation

The quality of a Neural Machine Translation system depends substantially on the availability of sizable parallel corpora. For low-resource language pairs this is not the case, resulting in poor translation quality. Inspired by work in computer vision, we propose a novel data augmentation approach that targets low-frequency words by generating new sentence pairs containing rare words in new, syn...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: ACM Transactions on Asian and Low-Resource Language Information Processing

سال: 2022

ISSN: ['2375-4699', '2375-4702']

DOI: https://doi.org/10.1145/3491065